Collection of Tools & Utilities

home *** CD-ROM | disk | FTP | other *** search

/ Collection of Tools & Utilities / Collection of Tools and Utilities.iso / tex / kjdc9308.zip / kanjidic.doc < prev next >

Wrap

Text File | 1993-08-26 | 13KB | 310 lines

K A N J I D I C =============== Introduction ------------ Kanjidic contains comprehensive information about the Japanese kanji characters. It is a text file currently 6,353 lines long, with one line for each kanji in the two levels of the JIS X 0208-1983 set. (For information about this set, see the Appendix 1.) Eventually it will be upgraded to the JIS X 0208-1990 version. The file contains a mixture of ASCII characters and kana/kanji encoded using the EUC (Extended Unix Code) coding. Contents & Format ----------------- The first part of each line is of a fixed format, indicating which character the line is for, while the rest is more free-format. The first two bytes are the kanji itself. There is then a space, the 4-byte ASCII representation of the hexadecimal coding of the two-byte JIS encoding, and another space. The rest of the line is composed of a combination of three kinds of fields (which may be in any order and interspersed): 1) Readings (with '-' to indicate prefixes/suffixes, and '.' to separate a reading from its okurigana). ON-yomi are in katakana, while KUN-yomi are in hiragana. 2) English translations and/or notes. Each such field begins with an open brace '{' and ends at the next close brace '}'. 3) Information fields, beginning with an identifying letter and ending with a space. There are currently a variety of predefined fields (program using kanjidic should not make any assumptions about the presence or absence of any of these fields, as kanjidic is certain to be extended in the future): B<num> -- The radical (Bushu) number. There is at least one per line. As far as possible, this is the radical number used in Nelson. Where the classical or historical radical number differs from this, it is present as a separate C<num> entry. There should be one Bnnn only. C<num> -- The historical or classical radical number (where this differs from the B<num> entry.) There may be zero, one or several of these. F<num> -- The frequency-of-use ranking. At most one per line. The 2,135 most-used characters have a ranking. Those characters that lack this field are not ranked. G<num> -- The Jouyou grade level. At most one per line. G1 through G6 indicate Joyo grades 1-6. G8 indicates general-use characters. G9 indicates Jinmeiyou ("for use in names") characters. If not present, it is a kanji outside these categories. H<num> -- The index number in Jack Halpern's dictionary. At most one allowed per line. If not preset, the character is not in Halpern. N<num> -- The index number in the Nelson dictionary. At most one allowed per line. If not present, the character is not in Nelson, or is considered to be a non-standard version, in which case there will be {see Nnnn} appended. P<code> -- The SK*P pattern code (similar to Halpern). The <code> is of the form "P<num>-<num>-<num>". See Halpern for a description of his SKIP pattern code, which is similar to this. A brief summary of the method is in Appendix 3 [NB: the Pn-n-n codes have been removed from kanjidic as of 4 August 1993. The removable has taken place to avoid violation of Mr Halpern's copyright of this list of codes.] S<num> -- The stroke count. At least one per line. If more than one, the first is considered the accepted count, while subsequent ones are common miscounts. U<hexnum> - Exactly one per line. The Unicode encoding of the kanji. See Appendix 2 for futher information on this. Qnnnn.n - The "Four Corner" code for that kanji. This is a rather old code used in China and Japan. In some cases there are two of these codes, as it is a little ambiguous. MNnnnnnnn and MPnn.nnnn The index number and volume.page respectively of the kanji in the 13-volume Morohashi "DaiKanWaJiten. Ennnn - The index number used in "A Guide To Remembering Japanese Characters" by Kenneth G. Henshall. There are 1945 kanji with these numbers (i.e. the Jouyou subset.) Yxxxxx - The "PinYin" of each kanji, i.e. the (Mandarin or Beijing) Chinese romanization. About 6,000 of the kanji have these. Obviously the native Japanese kokuji do not have PinYin. (Many of the kanji also have indices for the Spahn & Hadamitsky dictionary. At present they are encoded in the "meaning" field, but will shortly be moved to the index region of the records.) If the final field of a line is not an English field, there is a final space. Each reading and info field is therefore bracketed by a space (which makes it convenient for grep-based searches). As far as possible all entries will have their yomikata and readings attached, even if they are a recognized variant of another kanji. This is to facilitate electronic searches using these fields as keys, and should not be taken as a recommendation to use such obscure kanji. Usage ----- Kanjidic is used now to build the "kinfo.dat" file which is used by JDIC and JREADER, and by Stephen Chung's JWP. "kinfo.dat" contains the identical information, but in a compressed form and in a structure suitable for fast indexed access. Kanjidic is also used in the XJDIC program. Support ------- Kanjidic was originally compiled, and is maintained by: Jim Breen (jwb@capek.rdt.monash.edu.au) Department of Robotics & Digital Technology Monash University, Victoria, Australia If you have changes, send diffs [not complete files] with corrections to him. Too Much Information? --------------------- Kanjidic is now rather large, and has information in it which is not much use for people who are not studying and researching Japanese orthography. It is still appropriate to maintain it as a useful compendium of such information in the Public Domain. For people who only wish to use a subset of the information in kanjidic, there is a program "kdfilt.c", also available as kdfilt.exe for MS-DOS, which will strip out unwanted fields. History (comments by Jim Breen) ------- Kanjidic began as two files: jis1detl.lst and jis2detl.lst. The first file was compiled initially from the file "kinfo.dat" supplied by Stephen Chung, who in turn compiled his file from a file prepared by Mike Erickson. I originally added about 1900 "meanings" by James Heisig keyed in by Kevin Moore from the book "Remembering The Kanji". I later added the ex-Nelson meanings from Rik Smoody's files, compiled when he was working for Sony in Japan. The second file was compiled from a complete JIS2 list with Bushu and stroke counts kindly supplied to me by Jon Crossley, to which I added Nelson numbers, yomikata and meanings extracted from a dictionary file prepared by Rik Smoody at Sony. The file is being continually updated with extra and corrected yomikata, Nelson nos, meanings, etc. Theresa Martin has been a great assistance with this, particularly with tracking down and correcting many mistranscribed yomikata (the old zu/dzu, oo/ou, ji/dji, etc. problems). Jeffrey Friedl did a major overhaul in September-October 1992, in which he added frequency rankings, Halpern codes, SK*P patterns, updated the grading ("G" fields) to reflect the modern Jouyou lists, corrected radical numbers, corrected stroke counts and readings to fall in line with modern usage. Magnus Halldorsson corrected some erroneous Halpern numbers, and provided them for a lot of the radicals. Lee Collins provided the Unicode mappings (see appendix 2) Iain Sinclair has provided the yomikata, meanings and S&H indices of many of the obscure JIS2 kanji. Christian Wittern, a Sinologist working at Kyouto U, sent me a monster file prepared and released by Dr Urs App from Hanazono University. From this I have extracted the "Four Corner", Morohashi and PinYin information. I am very grateful for this significant contribution. Alfredo Pinochet supplied all the Henshall numbers. In July 1993, aft